Discovering Objects in Dynamically-Generated Web Pages

نویسندگان

  • James Caverlee
  • David Buttler
  • Ling Liu
چکیده

As the web grows, more and more content is being hidden from the reach of traditional search engines. In this paper, we present THOR, a scalable and efficient tool to mine objects from this hidden web. With precision and recall over 90%, THOR automatically extracts objects of interest from dynamically-generated web pages. Then customized objectidentification algorithms are applied to locate the “interesting” objects in each page. We show that dynamicallygenerated pages tend to be a homogenous subset of pages found on the Web, and that these pages may be separated into distinct clusters of structurally-similar pages. Using this homogeneity across clusters along with traditional information retrieval techniques, we propose a two-phase clustering scheme consisting of a page clustering algorithm and a fragment clustering algorithm. Using this scheme, we can identify object-rich fragments of each page with an average of over 90% precision and over 95% recall.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inférer des Objets Sémantiques du Web Structuré

This thesis focuses on the extraction and analysis of Web data objects, investigated from different points of view: temporal, structural, semantic. We first survey different strategies and best practices for deriving temporal aspects of Web pages, together with a more in-depth study on Web feeds for this particular purpose, and other statistics. Next, in the context of dynamically-generated Web...

متن کامل

Analysis of navigation behaviour in web sitesintegrating multiple information

The analysis of web usage has mostly fo-cused on sites composed of conventional static pages. However, huge amounts of information available in the web come from databases or other data collections and are presented to the users in the form of dynamically generated pages. The query interfaces of such sites allow the speciication of many search criteria. Their generated results support navigatio...

متن کامل

Adaptive Web Prefetching Scheme using Link Anchor Information

Web prefetching provides an effective mechanism to mitigate the user perceived latency when accessing the web pages. The content of web pages provide useful information for generating the predictions, which are used to prefetch the web objects for satisfying the user‟s future requests. In this paper, we propose fuzzy logic based web prefetching scheme that generates effective predictions for pr...

متن کامل

Discovering Test Set Regularities in Relational Domains

Machine learning typically involves discovering regularities in a training set, then applying these learned regularities to classify objects in a test set. In this paper we present an approach to discovering additional regularities in the test set, and show that in relational domains such test set regularities can be used to improve classification accuracy beyond that achieved using the trainin...

متن کامل

A Categorical Model for Discovering Latent Structure in Social Annotations

The advent of social tagging systems has enabled a new community-based view of the Web in which objects like images, videos, and Web pages are annotated by thousands of users. Understanding the emergent semantics inherent in the socially-generated collection of annotations has important research implications for information discovery and knowledge sharing. To this end, we propose a novel probab...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003